Indexing in the SAS System , Version 6 ®

نویسندگان

  • Denise J. Moorman
  • Deanna Warner
چکیده

The purpose of this article is to explain indexing in the SAS System. Questions and answers are provided to help ® you understand what an index is and how one works, when and how to create an index, when SAS uses an index, how an index is maintained, and what you can do to enhance the performance of an index. Content The following topics are discussed: • Introduction • Understanding the Index file • Deciding Whether to Create an Index • Creating an Index • Using an Index for WHERE Processing • Using an Index with Views • Using an Index for BY Processing • Using an Index with the KEY= Option • Maintaining Indexes • Enhancing Performance • Conclusion • References. Introduction Indexing became available in Version 6 of the SAS System to improve performance of SAS applications. An index, like any tool, is more beneficial when understood. This first set of questions provides some general information about SAS indexes. Question: What is an index? Answer: An index is an optional file that you can create for a SAS data set to provide quicker and more efficient access to observations. The index stores values in ascending value order for a specific variable or variables and includes information as to the location of those values within observations in the data set. That is, an index provides the ability to locate an observation by value. When creating an index, you designate which variable or variables to index; an indexed variable is called a key variable. Question: When does the SAS System use an index? Answer: In general, SAS can use an index to improve performance in the following situations: • For WHERE processing, an index can provide faster and more efficient access to a subset of data. Note that to process a WHERE expression, SAS decides whether to use the index or read the entire data set sequentially. • For BY processing, an index returns observations in the index order even when the data set is not stored in that order. • For the SET and MODIFY statements, the KEY= option allows you to specify an index in a DATA step to retrieve particular observations in a data set. Question: How does the SAS System use an index? Answer: SAS uses an index to directly access a value. For example, suppose you want the observation with SSN (social security number) equal to 465-33-8613: • Without an index, SAS accesses observations sequentially in the order they are stored in the data set. SAS reads each observation one after another, looking for SSN=465-33-8613 until all observations are read. C With an index for the variable SSN, SAS accesses the observation directly. SAS satisfies the condition using the index and goes straight to the observation containing the value without having to read each observation in the data set. Question: What type of indexes are available? Answer: There are two types of indexes: a simple index and a composite index. A data set can have multiple simple indexes, composite indexes, or both. • A simple index is an index of values for one key variable. SAS automatically names the index the same name as the variable. • A composite index is an index of two or more key variables. The values from the variables are concatenated to form a single value. When you create a composite index, you specify a name for the index. In addition to deciding whether you want a simple index or a composite index, you can also: C limit an index (and its data set) to unique values by specifying the UNIQUE option when you create an index. This prevents duplicate values from being stored in both the data set and the index. C exclude missing values from the index. Missing values can be prohibited from being stored in the index by specifying the NOMISS option when you create an index. However, the missing values are permitted to exist in the data set. Understanding the Index File The index file is a SAS file, which has the same name as its associated data set and a member type of INDEX. There is only one index file per data set...all indexes for a data set are stored in a single file. Question: Are indexes stored in the same physical file as the data set? Answer: For the MVS operating system, the index file and data set reside in the same physical location. However, in other operating environments, the index file is a separate file. In any case, the index file is stored in the same SAS library as its data set. Question: What is the internal structure of an index file? Answer: The SAS System creates a binary tree (B-Tree) hierarchical structure consisting of entries that represent each distinct value for a key variable. The entries are in ascending value order. Each entry consists of a distinct value and one or more unique record identifiers (referred to as a RID) that identifies each observation containing the value. You can think of the RID as an internal observation number. For example, the following table represents index entries for the variable LASTNAME: Avery 10 Brown 6,22,43 Craig 5,50 Dunn 1 SAS automatically keeps the index file balanced as updates are made, resulting in a uniform resource cost to access any index entry. All space occupied by deleted values is recovered and reused. Question: Are there any other internal structures of an index file besides that of a B-tree? Answer: Yes. On special hardware, a bitmap index is used. Refer to Scalable Performance Data Server User’s TM Guide, Version 1, First Edition. Question: How is the index file read? Is the entire index file loaded to memory when opened? Answer: When an index is used to process a request, such as a WHERE expression, SAS does a binary search on the index file and positions the index to the first entry that contains a qualified value. SAS then uses the value’s RID or RIDs to read the observations containing the value. The entire index file is not loaded to memory; one index page is accessed at a time. The most recent index page is kept in memory. Question: Extra disk space is required to store the index file. Is there a way to calculate the amount of disk space needed? Answer: To calculate disk space required for an index file, you can use a formula provided in the 1989 SUGI Proceedings, “Using New SAS Database Features and Options” on page 343. This formula is appropriate for all operating environments and both simple and composite indexes. The formula can be accessed on the SAS Institute World Wide Web at the following URL: www.sas.com/techsup/download/datastep/indexest.sas Caution: The formula is untested and should be used cautiously. Question: If a data set is encrypted using the ENCRYPT= data set option, is the index file also encrypted? Answer: Yes, the index file is also encrypted. Prior to Release 6.08 TSLEVEL 430, the utility files used to sort the values when creating an index were not encrypted. Deciding Whether to Create an Index Even though an index can reduce the time required to locate a set of observations, especially for a large data set, there are costs associated with creating, storing, and maintaining the index. When deciding whether to create an index, you must consider increased resource usage along with the performance improvement. Question: Can you provide some guidelines for determining whether to index? Answer: When considering whether to index a data set, use the following guidelines: C Create an index when you intend to retrieve a small subset of observations, for example, less than 25% of all observations. C For a small data set, sequential processing is often just as efficient as index processing. If the number of data set pages is less than three, then it is not beneficial to index any variable in the data set. To determine the number of data set pages, issue the CONTENTS procedure. Note that the available information from PROC CONTENTS depends on the operating environment. C Consider how often your applications will use an index. An index must be used often in order to make up for the resources used from creating and maintaining it. C Limit the number of indexes for a data set to reduce disk storage and to reduce update costs. C Do not create an index for a data set that is frequently changed. If you have a data set that changes often, the overhead associated with updating the index after each change can outweigh the processing advantages you gain from using the index. Question: Using the previous guidelines, my data set is a good candidate for indexing, but how do I decide which variable or variables to index? Answer: In most cases, multiple variables are used to query a data set; however, it would be a mistake to index all variables. You should index variables whose data is discriminate and uniformly distributed. Question: What is a discriminate variable? Answer: A key variable should be discriminating, which means that the variable can limit the number of observations returned. Select a key variable that tends to have more unique values for observations. For example, EMPID and LASTNAME are discriminating because it is less likely that many employees share the same employee number or last name. However, AGE, FRSTNAME, and GENDER are not discriminating because it is very possible for a large number of observations to have the same age, first name, and gender. Question: Why should a key variable have data that is uniformly distributed? Answer: Data that is uniformly distributed means that the number of observations are evenly distributed between the minimum and maximum values. Data distribution is important because of the way SAS decides whether to use an index to process a WHERE expression. When determining whether to use an index, SAS estimates the number of observations that will be qualified, then compares resource costs of using the index to sequentially reading the entire data set. To estimate the number of qualified observations, SAS assumes that the data is uniformly distributed. To determine data distribution for a variable, you can use PROC CHART or PROC UNIVARIATE. If the plot line is relatively smooth or flat, then the data is uniformly distributed. If the plot line portrays the classic bell curve, the data has a normal distribution and indexing would tend to be less effective. For more information on PROC CHART and PROC UNIVARIATE, refer to SAS Procedures Guide, Version 6, Third Edition. ® For example, assume you are considering indexing a variable that represents IQ values ranging from 55 to 145. Below is a PROC CHART on data set WORK.IQ for variable IQ:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Compatibility of SAS Data Library

This paper presents an overview of backwards and forwards compatibility between Versions 6 and 7 of the SAS System and is intended to be used as a reference in determining what manner of access is allowed. Furthermore, this paper takes into account differences in compatibility between the Base SAS System and when products such as SAS/CONNECT and SAS/SHARE are used. SAS I/O Services The I/O Se...

متن کامل

Performance Issues to Consider When Migrating Production Jobs from Version 5 to Version 6 of the SAS® System under MVS

This paper deals with performance issues when migrating production jobs from Version 5 to Version 6 of the SAS® System under MVS. Due to the significant differences between the architectures of Version 6 and Version 5 of the SAS System, the user might experience performance degradation/improvement when migrating some jobs to Version 6 of the SAS System. Specific areas to consider are formats an...

متن کامل

A Feasibility Study of IMRT of Lung Cancer Using Gafchromic EBT3 Film

Background: Intensity modulated radiation therapy (IMRT) is an advanced method for delivery of three dimensional therapies, which provides optimal dose distribution with giving multiple nonuniform fluency to the patient. The complex dose distribution of IMRT should be checked to ensure that the accurate dose is delivered. Today, film dosimetry is a powerful tool for radiotherapy treatment Quali...

متن کامل

Qualitative Risk Assessment of Gas Pipelines by Using of Indexing System Method in GIS environment

Nowadays the urbanization is developing rapidly, and it leads to growing demand for gas; which resulted in denser pipeline network, by the following increase in the pipeline network congestion, accidents  will become inevitable. So Pipelines are a remarkable source of hazard for their adjacent society. Usually Indexing system method is used for pipe line risk assessment. This method assesses ri...

متن کامل

Table Lookup via Direct Addressing: Key-Indexing, Bitmapping, Hashing

In SAS data processing, searching is one of the most frequent operations. Base SAS offers a rich collection of built-in searching techniques. MERGE, SQL joins, formats, SAS indexes all serve the purpose of the table lookup. For do-it-yourselfers, SAS offers arrays – directly addressable data structures suited for implementing just about any searching algorithm. An array-based lookup is not a r...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998